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Abstract 

We  describe  our  approach  to  the  construction  and  evaluation  of  a  large-scale  database 
called  ’’CatVar”  which  contains  categorial  variations  of  English  lexemes.  Due  to  the 
prevalence  of  crosslanguage  categorial  variation  in  multilingual  applications,  our  categorial- 
variation  resource  may  serve  as  an  integral  part  of  a  diverse  range  of  natural  language  ap¬ 
plications.  Thus,  the  research  reported  herein  overlaps  heavily  with  that  of  the  machine- 
translation,  lexiconconstruction,  and  information-retrieval  communities.  We  apply  the 
information-retrieval  metrics  of  precision  and  recall  to  evaluate  the  accuracy  and  coverage 
of  our  database  with  respect  to  a  human-produced  gold  standard.  This  evaluation  reveals 
that  the  categorial  database  achieves  a  high  degree  of  precision  and  recall.  Additionally, 
we  demonstrate  that  the  database  improves  on  the  linkability  of  Porter  Stemmer  by  over 
30/%. 
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Abstract 

We  describe  our  approach  to  the  con¬ 
struction  and  evaluation  of  a  large-scale 
database  called  “CatVar”  which  con¬ 
tains  categorial  variations  of  English  lex¬ 
emes.  Due  to  the  prevalence  of  cross¬ 
language  categorial  variation  in  multilin¬ 
gual  applications,  our  categorial-variation 
resource  may  serve  as  an  integral  paid 
of  a  diverse  range  of  natural  lan¬ 
guage  applications.  Thus,  the  research 
reported  herein  overlaps  heavily  with 
that  of  the  machine-translation,  lexicon- 
construction,  and  information-retrieval 
communities. 

We  apply  the  information-retrieval  met¬ 
rics  of  precision  and  recall  to  evaluate 
the  accuracy  and  coverage  of  our  database 
with  respect  to  a  human-produced  gold 
standard.  This  evaluation  reveals  that  the 
categorial  database  achieves  a  high  degree 
of  precision  and  recall.  Additionally,  we 
demonstrate  that  the  database  improves  on 
the  linkability  of  Porter  Stemmer  by  over 
30%. 

1  Introduction 

Natural  Language  Processing  (NLP)  applications 
may  only  be  as  good  as  the  resources  upon  which 
they  rely.  Resources  specifying  the  relations  among 
lexical  items  such  as  WordNet  (Lellbaum,  1998)  and 
HowNet  (Dong,  2000)  (among  others)  have  been 
used  effectively  in  many  NLP  systems. 


In  this  paper  we  introduce  a  new  resource  called 
CatVar  which  specifies  the  lexical  relation  Catego¬ 
rial  Variation  on  a  large  scale  for  English.  This  re¬ 
source  has  already  been  used  effectively  in  a  wide 
range  of  monolingual  and  multilingual  NLP  appli¬ 
cations.  Upon  its  first  public  release,  Catvar  will 
be  freely  available  to  the  research  community.  We 
expect  that  the  contribution  of  this  resource  will  be¬ 
come  more  widely  recognized  through  its  future  in¬ 
corporation  into  additional  NLP  applications. 

A  categorial  variation  of  a  word  with  a  cer¬ 
tain  part-of-speech  is  a  derivationally-related  word 
with  possibly  a  different  part-of-speech.  Lor  exam¬ 
ple,  hungery,  hunger yy  and  hungry \j  arc  categorial 
variations  of  each  other,  as  arc  crossy  and  acrossp, 
and  staby  and  stab jy.  Although  this  relation  seems 
basic  on  the  surface,  this  relation  is  critical  to  work 
in  information  retrieval  (IR),  natural  language  gen¬ 
eration  (NLG)  and  Machine  Translation  (MT) — yet 
there  is  no  large  scale  resource  available  for  English 
that  focuses  on  categorial  variations.1 

In  the  rest  of  this  paper  we  discuss  other  avail¬ 
able  resources  and  how  they  differ  from  the  Cat- 
Var  database.  We  then  discuss  how  and  what  re¬ 
sources  were  used  to  build  CatVar.  We  then  present 
three  applications  that  use  CatVar  in  different  ways: 
Generation-Heavy  MT,  headline  generation,  and 
cross-language  divergence  unraveling  for  bilingual 

'it  is  the  intention  of  the  WordNet  1.7  developers  to  in¬ 
clude  such  information  in  their  next  version,  but  only  for  nouns 
and  verbs  (Christiane  Fellbaum,  pc.),  not  other  pairings  such  as 
noun-adjective,  verb-preposition  relationships.  Discussions  are 
currently  underway  for  sharing  the  CatVar  database  with  Word- 
Net  developers  for  more  rapid  development,  extension,  and  mu¬ 
tual  validation  of  both  resources. 


alignment.  Finally,  we  present  a  multi-component 
evaluation  of  the  database.  Our  evaluation  reveals 
that  the  categorial  database  achieves  a  high  degree 
of  precision  and  recall  and  that  it  improves  on  the 
linkability  of  Porter  Stemmer  by  over  30%. 

2  Background 

Lexical  relations  describe  relative  relationships 
among  different  lexemes.  According  to  (Cruse, 
1986),  lexical  relations  arc  either  hierarchical  tax¬ 
onomic  relations  (such  as  hypernymy,  hyponymy 
and  entailments)  or  non-hierarchical  congruence  re¬ 
lations  (such  as  identity,  overlap,  synonymy  and 
antonymy). 

WordNet  is  the  most  well-developed  and  widely 
used  lexical  database  of  English  (Fellbaum,  1998). 
In  WordNet,  both  types  of  lexical  relations  are  spec¬ 
ified  among  words  with  the  same  paid  of  speech 
(verbs,  nouns,  adjectives  and  adverbs).  WordNet 
has  been  used  by  many  researchers  for  different  pur¬ 
poses  ranging  from  the  construction  or  extension 
of  knowledge  bases  such  as  SENSUS  (Knight  and 
Luk,  1994)  or  the  Lexical  Conceptual  Structure  Verb 
Database  (LVD)  (Green  et  al.,  2001)  to  the  faking 
of  meaning  ambiguity  as  paid  of  system  evaluation 
(Bangalore  and  Rambow,  2000).  In  the  context  of 
these  projects,  one  criticism  of  WordNet  is  its  lack 
of  cross-categorial  links,  such  as  verb-noun  or  noun¬ 
adjective  relations. 

Mel’cuk  approaches  lexical  relations  by  defining 
a  lexical  combinatorial  zone  that  specifies  seman¬ 
tically  related  lexemes  through  Lexical  Functions 
(LF).  These  functions  define  a  correspondence  be¬ 
tween  a  key  lexical  item  and  a  set  of  related  lexi¬ 
cal  items(Mel’cuk,  1988).  There  are  two  types  of 
functions:  paradigmatic  and  syntagmatic  (Ramos 
et  ah,  1994).  Paradigmatic  LFs  associate  a  lexi¬ 
cal  item  with  related  lexical  items.  The  relation 
can  be  semantic  or  syntactic.  Semantic  LFs  include 
Synonym(calling)  =  vocation,  Antonym(small)  = 
big,  and  Generic(fruit)  =  apple.  Syntactic  LFs  in¬ 
clude  Dcrivcd-Noun(cxpand)=  expansion  and  Ad- 
jective(female)  =  feminine. 

Syntagmatic  LFs  specify  collocations  with  a  lex¬ 
eme  given  a  specified  relationship.  For  example, 
there  is  a  LF  that  returns  a  light  verb  associated  with 
the  LF’s  key:  Light-Verb(attention)  =  pay.  Other 


LFs  specify  certain  semantic  associations  such  as 
Intensify-Qualifier(escape)  =  narrow  and  Degrada- 
tion(milk)  =  sour.  Lexical  Functions  have  been  used 
in  MT  and  Generation  (e.g.  (Ramos  et  al.,  1994)). 

Although  research  on  Lexical  Functions  provides 
an  intriguing  theoretical  discussion,  there  arc  no 
large  scale  resources  available  for  categorial  varia¬ 
tions  induced  by  lexical  functions.  This  lack  of  re¬ 
sources  shouldn’t  suggest  that  the  problem  is  too 
trivial  to  be  worthy  of  investigation  or  that  a  so¬ 
lution  would  not  be  a  significant  contribution.  On 
the  contrary,  categorial  variations  arc  necessary  for 
handling  many  NLP  problems.  For  example,  in  the 
context  of  MT,  (Habash  et  al.,  2002)  claims  that 
98%  of  all  translation  divergences  (variations  in  how 
source  and  target  languages  structure  meaning)  in¬ 
volve  some  form  of  categorial  variation.  Moreover, 
most  information  retrieval  systems  require  some 
way  to  reduce  variant  words  to  common  roots  to 
improve  the  ability  to  match  queries  (Xu  and  Croft, 
1998;  Hull  and  Grefenstette,  1996;  Krovetz,  1993). 

Given  the  lack  of  large-scale  resources  containing 
categorial  variations,  researchers  frequently  develop 
and  use  alternative  algorithmic  approximations  of 
such  a  resource.  These  approximations  can  be  di¬ 
vided  into  Reductionist  (Analytical)  or  Expansionist 
(Generative)  approximations.  The  former  focuses 
on  the  conversion  of  several  surface  forms  into  a 
common  root.  Stemmers  such  as  the  Porter  Stem¬ 
mer  (Porter,  1980)  arc  a  typical  example.  The  lat¬ 
ter,  or  expansionist  approaches,  overgenerate  possi¬ 
bilities  and  rely  on  a  statistical  language  model  to 
rank/select  among  them.  The  morphological  gener¬ 
ator  in  Nitrogen  is  an  example  of  such  an  approxi¬ 
mation  (Langkilde  and  Knight,  1998). 

There  are  two  types  of  problems  with  approxima¬ 
tions  of  this  type:  (1)  They  arc  uni-directional  and 
thus  limited  in  usability — A  stemmer  cannot  be  used 
for  generation  and  a  morphological  overgenerator 
cannot  be  used  for  stemming;  (2)  The  crude  approxi¬ 
mating  nature  of  such  systems  cause  many  problems 
in  quality  and  efficiency  from  over-stemming/under- 
stemming  or  over-generation/under-generation. 

Consider,  for  example,  the  Porter  Stemmer, 
which  stems  commune jv,  communication n  and 
communism jv  to  commun.  And  yet,  it  does 
not  produce  this  same  stem  for  communist n  or 
communicable aj  (stemmed  to  communist  and 


communic  respectively).2  Another  example  is 
the  expansionist  Nitrogen  morphological  generator, 
where  the  morphological  feature  +nominalize  — 
verb  applied  to  develop  returns  eleven  varia¬ 
tions  including  *developage,  *developication  and 
*developy.  Only  two  are  correct  ( development  and 
developing).  Such  overgeneration  multiplied  out 
at  different  points  in  a  sentence  expands  the  search 
space  exponentially,  and  given  various  cut-offs  in 
the  search  algorithm,  might  even  appear  in  some  of 
the  top  ranked  choices. 

Given  these  issues,  our  goal  is  to  build  a  database 
of  categorial  variations  that  can  be  used  with  both 
expansionist  and  reductionist  approaches  without 
the  cost  of  over/under-stemming/generation.  The 
research  reported  herein  is  relevant  to  machine- 
translation,  lexicon-construction,  and  information- 
retrieval. 

First,  we  describe  the  construction  of  the  “Cat- 
Var”  database  and  its  use  in  multilingual  applica¬ 
tions.  Following  this,  we  demonstrate  the  appli¬ 
cation  of  information-retrieval  metrics  of  precision 
and  recall  in  an  evaluation  of  our  database  with  re¬ 
spect  to  a  human-produced  gold  standard.  Finally, 
we  demonstrate  that  the  database  improves  on  the 
linkability  of  Porter  Stemmer  by  over  30%. 

3  Building  the  CatVar 

The  CatVar  database  was  developed  using  a  com¬ 
bination  of  resources  and  algorithms  including  the 
LCS  Verb  and  Preposition  Databases  (Dorr,  2001), 
the  Brown  Corpus  section  of  the  Penn  Treebank 
(Marcus  et  ah,  1993),  an  English  morphological 
analysis  lexicon  developed  for  PC-Kimmo  (En- 
glex)  (Antworth,  1990),  NOMLEX  (Macleod  et  al., 
1998),  Longman  Dictionary  of  Contemporary  En¬ 
glish  (LDOCE)3  (Procter,  1983),  WordNet  1.6  (Fell- 
baum,  1998),  and  the  Porter  stemmer  (Porter,  1980). 
The  contribution  of  each  of  these  sources  is  clearly 
labeled  in  the  CatVar  database,  thus  enabling  the  use 
of  different  cross-sections  of  the  resource  for  differ¬ 
ent  applications.4 

2For  a  deeper  discussion  and  classification  of  Porter  Stem- 
mer’s  errors,  see  (Krovetz,  1993). 

’An  English  Verb-Noun  list  extracted  from  LDOCE  was 
provided  by  Rebecca  Green. 

4For  example,  in  a  headline  generation  system  (HeadGen), 
higher  Bleu  scores  were  obtained  when  using  the  portions  of  the 


Some  of  these  resources  were  used  to  extract 
seed  links  between  different  words  (Englex  lexicon, 
NOMLEX  and  LDOCE).  Others  were  used  to  pro¬ 
vide  a  large-scale  coverage  of  lexemes.  In  the  case 
of  the  Brown  Corpus,  which  doesn’t  provide  lex¬ 
emes  for  its  words,  the  Englex  morphological  an¬ 
alyzer  was  used  together  with  the  part  of  speech 
specified  in  the  Penn  Tree  Bank  to  extract  the  lex¬ 
eme  form.  The  Porter  stemmer  was  later  used  as 
part  of  a  clustering  step  to  expand  the  seed  links  to 
create  clusters  of  words  that  are  categorial  variants 
of  each  other,  e.g.,  hunger n,  hungry aj,  hungery, 
hungrinessN . 

The  current  version  of  the  CatVar  (version  2.0)  in¬ 
cludes  62,232  clusters  covering  96,368  unique  lex¬ 
emes.  The  lexemes  belong  to  one  of  four  parts- 
of-speech  (Noun  62%,  Adjective  24%,  Verb  10% 
and  Adverb  4%).  Almost  half  of  the  clusters  cur¬ 
rently  include  one  word  only.  Three-quarters  of 
these  single-word  clusters  are  nouns  and  one-fifth 
are  adjectives.  The  other  half  of  the  words  is  dis¬ 
tributed  in  a  Zipf  fashion  over  clusters  from  size  2  to 
27.  Figure  1  shows  the  word-cluster  distribution. 


Word  vs  Cluster  Size  Distribution 


Figure  1 :  CatVar  Distribution 

A  smaller  supplementary  database  devoted  to 
verb-preposition  variations  was  constructed  solely 
from  the  LCS  verb  and  preposition  lexicon  using 
shared  LCS  primitives  to  cluster.  The  database  was 
inspired  by  pairs  such  as  crossy  and  acrossp  which 
are  used  in  Generation-Heavy  MT.  But  since  verb- 
preposition  clusters  are  not  typically  morphologi¬ 
cally  related,  they  are  kept  separate  from  the  rest  of 

CatVar  database  that  are  most  relevant  to  nominalized  events 
(e.g.,  NOMLEX). 


the  CatVar  database  and  they  were  not  included  in 
the  evaluation  presented  in  this  paper.5 

The  CatVar  is  web-browseable  at 
http://clipdemos.umiacs.umd.edu/catvar/.  Fig¬ 
ure  2  shows  the  CatVar  web-based  interface  with 
the  hunger  cluster  as  an  example.  The  interface 
allows  searching  clusters  using  regular  expressions 
as  well  as  cluster  length  restrictions.  The  database 
is  also  available  for  researchers  in  perl/C  and  lisp 
searchable  formats. 


Figure  2:  Web  Interface 


4  Applications 

Our  project  is  focused  on  resource  building  and  eval¬ 
uation.  However,  the  CatVar  database  is  relevant  to  a 
number  of  natural  language  applications,  including 
generation  for  MT,  headline  generation,  and  cross¬ 
language  divergence  unraveling  for  bilingual  align¬ 
ment.  Each  of  these  are  discussed  below,  in  turn. 

4.1  Generation-Heavy  Machine  Translation 

The  Generation-Heavy  Hybrid  Machine  Transla¬ 
tion  (GHMT)  model  was  introduced  in  (Habash, 
2002)  to  handle  translation  divergences  between  lan¬ 
guage  pairs  with  asymmetrical  (poor-source/rich- 
target)  resources.  The  approach  does  not  rely  on  a 
transfer  lexicon  or  a  common  interlingual  represen¬ 
tation  to  map  between  divergent  structural  configu- 

3This  supplementary  database  includes  242  clusters  for 
more  than  230  verbs  and  29  prepositions.  Other  examples 
of  verb-preposition  clusters  include:  avoidy  and  away  from p ; 
entery  and  intop ;  and  bordery  and  besidep  (or  next  top). 


rations  from  source  to  target  language.  Instead,  dif¬ 
ferent  alternative  structural  configurations  are  over¬ 
generated  and  these  are  statistically  ranked  using  a 
language  model. 

The  CatVar  database  is  used  as  one  of  the  con¬ 
straints  on  the  structural  expansion  step.  For  exam¬ 
ple,  to  allow  the  conflation  of  verbs  such  as  makey 
or  causey  and  an  argument  such  as  developments , 
the  first  condition  for  conflatability  is  finding  a  verb 
categorial  variant  of  the  argument  developments .  In 
this  case  the  verb  categorial  variant  is  develop^  fi 

4.2  Headline  Generation 

The  HeadGen  headline  generator  was  introduced  in 
(Zajic  et  ah,  2002)  to  create  headlines  automatically 
from  newspaper  text.  The  goal  is  to  generate  an 
informative  headline  (one  that  specifies  the  event 
and  its  participants)  not  just  an  indicative  headline 
(which  specifies  the  topic  only).  The  system  is  im¬ 
plemented  as  a  Hidden  Markov  Model  enhanced 
with  a  postprocessor  that  filters  out  headlines  that 
do  not  contain  a  verbal  or  nominalized  event.  This  is 
achieved  by  verifying  that  there  is  at  least  one  word 
in  the  generated  headline  that  appears  in  CatVar  as  a 
V  (a  verbal  event)  or  as  a  N  whose  verbal  counter¬ 
part  is  in  the  same  cluster  (a  nominalized  event). 

A  recent  study  indicates  that  there  is  a  signif¬ 
icant  improvement  in  Bleu  scores  (using  human¬ 
generated  headlines  as  our  references)  when  running 
headline  generation  with  the  CatVar  filter:6 7 

•  HeadGen  with  CatVar  filter:  0.1740 

•  HeadGen  with  no  CatVar  filter:  0.1687 

This  quantitative  distinction  correlates  with  human- 
perceived  differences,  e.g.,  between  the  two  head¬ 
lines  Washingtonians  fight  over  drugs  and  In  the  na¬ 
tion’s  capital  (generated  for  the  same  story — with 
and  without  CatVar,  respectively). 

4.3  DUSTer 

DUSTer — Divergence  Unraveling  for  Statistical 
Translation — was  introduced  in  (Dorr  et  ah,  2002). 

6The  other  conditions  on  conflatability  and  some  detailed 
examples  are  discussed  in  (Habash,  2002)  and  (Habash  and 
Dorr,  2002). 

7For  details  about  the  Bleu  evaluation  metric,  see  (Papineni 
et  al.,  2002). 


In  this  system,  common  divergence  types  are  sys¬ 
tematically  identified  and  English  sentences  are 
transformed  to  bear  a  closer  resemblance  to  that  of 
another  language  using  a  mapping  referred  to  as 
E-to-E1.  The  objective  is  to  enable  more  accu¬ 
rate  alignment  and  projection  of  dependency  trees 
in  another  language  without  requiring  any  training 
on  dependency-tree  data  in  that  language. 

The  CatVar  database  has  been  incorporated  into 
two  components  of  the  DUSTer  system:  (1)  In 
the  E-to-E'  mapping,  e.g.,  the  transformation  from 
kicky  to  LightVB  kicky  (corresponding  to  the  En¬ 
glish/Spanish  divergence  pair  kick/dar  patada)',  and 
(2)  During  an  automatic  mark-up  phase  prior  to  this 
transformation,  where  the  particular  E-to-E  map¬ 
ping  is  selected  from  a  set  of  possibilities  based 
on  the  2  input  sentences.  For  example,  the  rule 
V[  CatVar  =N]  ->  LightVB  N  is  selected  for 
the  transformation  above  by  first  checking  that  the 
verb  V  is  associated  with  a  word  of  category  N  in 
CatVar.  Transforming  divergent  English  sentences 
using  this  mechanism  has  been  shown  to  facilitate 
word-level  alignment  by  reducing  the  number  of  un¬ 
aligned  and  multiply-aligned  words. 

5  Evaluation 

This  section  includes  two  evaluations  concerned 
with  different  aspects  of  the  CatVar  database.  The 
first  evaluation  calculates  the  recall  and  precision  of 
CatVar’s  clustering  and  the  second  determines  the 
contribution  of  CatVar  over  Porter  Stemmer. 

5.1  CatVar  Clustering  Evaluation:  Recall  and 
Precision 

To  determine  the  recall  and  precision  of  CatVar 
given  the  lack  of  a  gold  standard,  we  asked  8  native 
speakers  to  evaluate  400  randomly-selected  clusters. 
Each  annotator  was  given  a  set  of  100  clusters  (with 
two  annotators  per  set).  Figure  3  shows  a  segment  of 
the  evaluation  interface  which  was  web-browseable. 

The  annotators  were  given  detailed  instructions 
and  many  examples  to  help  them  with  the  task.  They 
were  asked  to  classify  each  word  in  every  cluster  as 
belonging  to  one  of  the  following  categories: 

•  Perfect:  This  word  definitely  belongs  in  this 
cluster. 

•  Perfect  (except  for  part  of  speech  problem). 


Figure  3:  Evaluation 

•  Perfect  (except  for  spelling  problem). 

•  Not  Sure:  It  is  not  clear  whether  a  word  that  is 
derivationally  correct  belongs  in  a  set  or  not. 

•  Doesn’t  Belong:  This  word  doesn’t  belong  in 
this  cluster. 

•  May  not  be  a  Real  Word:  This  word  is  not 
known  and  couldn’t  be  found  it  in  a  dictionary. 

The  interface  also  provided  an  input  text  box  to 
add  missing  words  to  a  cluster. 

In  calculating  the  inter-annotator  agreement,  we 
did  not  consider  mismatches  in  word  additions  as 
disagreement  since  some  annotators  could  not  think 
up  as  many  possible  variations  as  others.  After  all, 
this  was  not  an  evaluation  of  then-  ability  to  think  up 
variations,  but  rather  of  the  coverage  of  the  CatVar 
database.  Even  though  there  were  six  fine-grained 
classifications,  the  average  inter-annotator  agree¬ 
ment  was  high  (80.75%).  Many  of  the  disagree¬ 
ments,  however,  resulted  from  the  fine-grainness  of 
the  options  available  to  the  annotators. 

In  a  second  calculation  of  inter- annotator  agree¬ 
ment,  we  simplified  the  annotators’  choices  by  plac¬ 
ing  them  into  three  groups  corresponding  to  Per¬ 
fect  (Perfect  and  Perfect -but),  Not-sure  (Not-sure 
and  May-not-be-a-real-word)  and  Wrong  (Does- 
not-belong).  This  annotation-grouping  approach  is 
comparable  to  the  clustering  techniques  used  by 
(Veronis,  1998)  to  “super-tag”  fine  grained  annota¬ 
tions.  After  grouping  the  annotations,  average  inter¬ 
annotator  agreement  rose  up  to  98.35%. 

The  cluster  modifications  produced  by  each  pair 
of  annotators  assigned  to  the  same  cluster  were 
then  combined  automatically  in  an  approximation 


to  post-annotation  inter-annotator  discussion,  which 
traditionally  results  in  agreement:  (1)  If  both  annota¬ 
tors  agreed  on  a  category,  then  it  stands;  (2)  One  an¬ 
notator  overrides  another  in  cases  where  one  is  more 
sure  than  the  other  (i.e.,  Perfect  overrides  Perfect- 
but-with-error/Not-sure  and  Wrong  overrides  Not- 
sure);  (3)  In  cases  where  one  annotator  considers  a 
word  Perfect  while  the  other  annotator  considered  it 
Wrong,  we  compromise  at  Not-sure.  The  union  of 
all  added  words  was  included  in  the  combined  clus¬ 
ter. 

The  400  combined  clusters  covered  808  words. 
68%  of  the  words  were  ranked  as  Perfect.  None 
had  spelling  errors  and  only  one  word  had  a  part-of- 
speech  issue.  23  words  (less  than  3%)  were  marked 
as  Not-sures.  And  only  6  words  (less  than  1%)  were 
marked  as  Wrong.  There  were  209  added  words 
(about  26%).  However  128  words  (or  61%  of  miss¬ 
ing  words)  were  not  actually  missing,  but  rather  not 
linked  into  the  set  of  clusters  evaluated  by  a  partic¬ 
ular  annotator.  Some  of  these  words  were  clustered 
separately  in  the  database.8  The  rest  of  the  miss¬ 
ing  words  (81  words  or  10%  of  all  words)  were  not 
present  in  the  database,  but  50  of  them  (or  62%) 
were  linkable  to  existing  words  in  the  CatVar  using 
simple  stemming  (e.g.,  the  Porter  stemmer,  whose 
relevance  is  described  next). 

The  precision  was  calculated  as  the  ratio  of  per¬ 
fect  words  to  all  original  (i.e.  not  added)  words: 
91.82%.  The  recall  was  calculated  as  the  ratio  of 
perfect  words  divided  by  all  perfect  plus  all  added 
words:  72.46%.  However,  if  we  exclude  the  not- 
really  missing  words,  the  adjusted  recall  value  be¬ 
comes  87.16%.  The  harmonic  mean  or  F-score9  of 
the  precision  and  recall  is  81.00%  (or  89.43%  for 
adjusted  recall). 

5.2  Linkability  Evaluation:  Comparison  to 
Porter  Stemmer 

To  measure  the  contribution  of  Catvar  with  respect 
to  the  “linking  together”  of  related  words,  it  is  im¬ 
portant  to  define  the  concept  of  linkability  as  the  per¬ 
centage  of  word-to-word  links  in  the  database  re¬ 
sulting  from  a  specific  source.  For  example,  Nat¬ 
ural  linkability  refers  to  pairs  of  words  whose  form 

8The  128  words  that  were  “not  really  missing”  were  clus¬ 
tered  in  89  other  clusters  not  included  in  the  evaluation  sample. 

9p  _  2X  Precision  X  Recall 

~  Precis  ion -{-Recall 


doesn’t  change  across  categories  such  as  zipv  and 
zipN  or  afghani  and  afgliaiiAj ■  Porter  linkability 
refers  to  words  linkable  by  reduction  to  a  common 
Porter  stem.  CatVar  linkability  is  the  linkability  of 
two  words  appearing  in  the  same  CatVar  cluster. 

Figure  4  shows  an  example  of  all  three  types  of 
links  in  the  hunger  cluster.  Here,  hunger jv  and 
hunger v  arc  linked  in  three  ways.  Naturally  (N),  by 
the  Porter  stemmer  (P),  and  in  CatVar  (C).  Porter 
links  hungryAj  and  hungriness n  via  the  common 
stem  hungri  but  Porter  doesn’t  link  either  of  these 
to  hunger jy  or  hungery  (stem  hunger).  The  total 
number  of  links  in  this  cluster  is  six,  two  of  which 
are  Porter-determinable  and  only  one  of  which  is 
naturally-determinable. 


Figure  4:  Three  Types  of  Links 


The  calculation  of  linkability  applies  only  to  the 
portion  of  the  database  containing  multi-word  clus¬ 
ters  (about  half  of  the  database)  since  single-word 
clusters  have  zero  links.  The  48,867  linked  words 
arc  distributed  over  14,731  clusters  with  89,638  to¬ 
tal  number  of  links.  About  12%  of  these  links  arc 
naturally-determinable  and  70%  arc  Porter-linkable. 
The  last  30%  of  the  links  is  a  significant  contribu¬ 
tion  of  the  CatVar  database,  compared  to  the  Porter 
Stemmer,  particularly  since  this  stemmer  is  an  in¬ 
dustry  standard  in  the  Information  Retrieval  commu¬ 
nity. 

It  is  important  to  point  out  that,  for  CatVar  to  be 
used  in  IR,  it  must  be  accompanied  by  an  inflectional 
analyzer  that  reduces  words  to  their  lexeme  form  (re¬ 
moving  plural  endings  from  nouns  or  gerund  end¬ 
ing  from  verbs).10  The  contribution  of  CatVar  is  in 
its  linking  of  words  related  derivationally  not  inflec- 
tionally.  Work  by  (Krovetz,  1993)  demonstrates  an 
improved  performance  with  derivational  stemming 
over  the  Porter  Stemmer  most  of  the  time. 

10This  is,  in  fact,  the  approach  used  in  the  HeadGen  and 
DUSTer  applications  described  above. 


6  Conclusions  and  Future  Work 

We  have  presented  our  approach  to  constructing 
and  evaluating  a  new  large-scale  database  contain¬ 
ing  categorial  variations  of  English  words.  In  ad¬ 
dition,  we  have  described  different  applications  for 
which  it  has  proven  useful.  Our  evaluation  indicates 
that  CatVar  has  coverage  and  accuracy  of  over  80% 
(F-score)  and  also  that  the  database  improves  the 
linkability  of  Porter  stemmer  by  about  30%.  These 
findings  are  significant  contributions  to  several  dif¬ 
ferent  communities,  including  information  retrieval 
and  machine  translation. 

Future  work  includes  improving  the  word-cluster 
ratio  and  absorbing  more  of  the  single-word  clusters 
into  existing  clusters  or  other  single-word  clusters. 
We  arc  also  considering  enriching  the  clusters  with 
types  of  derivational  relations  such  as  “nominal- 
event”  or  “doer”  to  complement  part-of-speech  la¬ 
bels.  Additionally,  we  arc  interested  in  measur¬ 
ing  the  applied  contribution  of  using  the  CatVar  in 
natural-language  applications.  And  finally,  we  in¬ 
tend  to  incorporate  CatVar  into  new  applications 
such  as  parallel  corpus  word  alignment. 
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